-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ILM wait for active shards on rolled index in a separate step #50718
ILM wait for active shards on rolled index in a separate step #50718
Conversation
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
@dakrone what are your thoughts on this? |
I took a brief look. I think maybe we should separate this and support the parameter on the regular Rollover API (right now it's only accessible from the Transport layer I think?). Once that is available. I think splitting the steps on the ILM side sounds like a good idea to me. We need to make sure to default to the current behavior for the Rollover API though, to make sure we don't accidentally make a breaking change. |
@dakrone the regular Rollover API supports `wait_for_active_shards" as a query parameter and the rollover client api supports it as well ( https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/admin/indices/rollover/RolloverRequestBuilder.java#L101 ) Are you thinking of exposing this as a parameter in the ILM Rollover action configuration or something else?
I'm not sure we should do this as we didn't see interest into something like this and it also requires validating which we can't perform (the value must not be larger than the number of primaries + replicas but this is a policy definition that can be applied to any index - and then the templates for the rolled index can make this even more complex) |
Ah no! I didn't realize we had already exposed it (since only the setter was added in the actual change). If we already support it then splitting the two sounds good to me. |
@elasticmachine update branch |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Andrei, I left a couple of comments!
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/RolloverStep.java
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
@elasticmachine update branch |
// if the rollover was not performed on a write index alias, the alias will be moved to the new index and it will be the only | ||
// index this alias points to | ||
List<IndexMetaData> indices = alias.getIndices(); | ||
assert indices.size() == 1 : "when performing rollover on alias with is_write_index = false the alias must point to only " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assertion doesn't stand in a CCR environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coming up with the fix by parsing the number from the index name and finding the rolled index by "max number"
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more comments, thanks Andrei
|
||
Alias alias = (Alias) aliasOrIndex; | ||
IndexMetaData aliasWriteIndex = alias.getWriteIndex(); | ||
String rolledIndexName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String rolledIndexName; | |
final String rolledIndexName; |
Alias alias = (Alias) aliasOrIndex; | ||
IndexMetaData aliasWriteIndex = alias.getWriteIndex(); | ||
String rolledIndexName; | ||
String waitForActiveShardsSettingValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String waitForActiveShardsSettingValue; | |
final String waitForActiveShardsSettingValue; |
// Index must have been since deleted | ||
logger.debug("unable to find the index that was rolled over from [{}] as part of lifecycle action [{}]", index.getName(), | ||
getKey().getAction()); | ||
return new Result(false, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return the above log message as part of the "info" ToXContent
object here? Otherwise this could stick in false forever and the user would never know why
*/ | ||
static int parseIndexNameCounter(String indexName) { | ||
int numberIndex = indexName.lastIndexOf("-"); | ||
assert numberIndex != -1 : "no separator '-' found"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this assert and throw a regular error
static int parseIndexNameCounter(String indexName) { | ||
int numberIndex = indexName.lastIndexOf("-"); | ||
assert numberIndex != -1 : "no separator '-' found"; | ||
return Integer.parseInt(indexName.substring(numberIndex + 1, indexName.endsWith(">") ? indexName.length() - 1 : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should catch the NumberFormatException
and format it into a nicer, more human readable exception. It's possible to hit this if someone were to take an index foo-000003
and snapshot it, then restore it with a different name (like foo-000003-restored
for example), this would flip out but it wouldn't be clear why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've kept the exception handling light here as this step follows closely the RolloverStep
(which would not succeed in the example you illustrated as the source index would not match the ^.*-\d+$
pattern). Its visibility is package default solely for testing purposes.
I guess renames and such could occur between steps, so I'll add the exception handling and remove the assumptions.
static final ParseField TARGET_ACTIVE_SHARDS_COUNT = new ParseField("target_active_shards_count"); | ||
static final ParseField ENOUGH_SHARDS_ACTIVE = new ParseField("enough_shards_active"); | ||
static final ParseField MESSAGE = new ParseField("message"); | ||
static final ConstructingObjectParser<Info, Void> PARSER = new ConstructingObjectParser<>("wait_for_active_shards_step_info", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this PARSER
is never actually used?
this.enoughShardsActive = enoughShardsActive; | ||
|
||
if (enoughShardsActive) { | ||
message = "The target of [" + targetActiveShardsCount + "] are active. Don't need to wait anymore."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but we should keep these lowercase, and I don't think we need a trailing period (these are more like error messages than sentences)
if (enoughShardsActive) { | ||
message = "The target of [" + targetActiveShardsCount + "] are active. Don't need to wait anymore."; | ||
} else { | ||
message = "Waiting for [" + targetActiveShardsCount + "] shards to become active, but only [" + currentActiveShardsCount + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about lowercasing/trailing-period here.
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Andrei!
…c#50718) After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active). This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and, the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created. (cherry picked from commit 457a92f) Signed-off-by: Andrei Dan <[email protected]>
#51296) After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active). This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and, the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created. (cherry picked from commit 457a92f) Signed-off-by: Andrei Dan <[email protected]>
After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the
index.write.wait_for_active_shards
setting which might be present in a template, or otherwise in the default case, for the primaries to become active).This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and the
RolloverStep
might not complete successfully due to this inherent transient situation, albeit the rolled index having been created.Relates to #48183 and #44135